## [1] 113937 81
We have 81 variables specified in the data which is too large for the project. Here are the list of variables which should make much more sense for analysis of the loan data.
For the new stripped down data set, here are the details for it
## [1] 113937 19
## [1] "Term" "LoanStatus"
## [3] "ClosedDate" "ListingCategory..numeric."
## [5] "BorrowerState" "Occupation"
## [7] "IncomeRange" "IncomeVerifiable"
## [9] "StatedMonthlyIncome" "CreditScoreRangeLower"
## [11] "ProsperScore" "EmploymentStatus"
## [13] "EmploymentStatusDuration" "CurrentCreditLines"
## [15] "TotalCreditLinespast7years" "DebtToIncomeRatio"
## [17] "BorrowerRate" "LoanOriginalAmount"
## [19] "LoanOriginationDate"
## 'data.frame': 113937 obs. of 19 variables:
## $ Term : int 36 36 36 36 36 60 36 36 36 36 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
## $ ClosedDate : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
## $ ListingCategory..numeric. : int 0 2 0 16 2 1 1 2 7 7 ...
## $ BorrowerState : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
## $ Occupation : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
## $ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
## $ IncomeVerifiable : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## $ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ ProsperScore : num NA 7 NA 9 4 10 2 4 9 11 ...
## $ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
## $ EmploymentStatusDuration : int 2 44 NA 113 44 82 172 103 269 269 ...
## $ CurrentCreditLines : int 5 14 NA 5 19 21 10 6 17 17 ...
## $ TotalCreditLinespast7years: int 12 29 3 29 49 49 20 10 32 32 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ LoanOriginationDate : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
## [1] "Cancelled" "Chargedoff"
## [3] "Completed" "Current"
## [5] "Defaulted" "FinalPaymentInProgress"
## [7] "Past Due (>120 days)" "Past Due (1-15 days)"
## [9] "Past Due (16-30 days)" "Past Due (31-60 days)"
## [11] "Past Due (61-90 days)" "Past Due (91-120 days)"
## [1] ""
## [2] "Accountant/CPA"
## [3] "Administrative Assistant"
## [4] "Analyst"
## [5] "Architect"
## [6] "Attorney"
## [7] "Biologist"
## [8] "Bus Driver"
## [9] "Car Dealer"
## [10] "Chemist"
## [11] "Civil Service"
## [12] "Clergy"
## [13] "Clerical"
## [14] "Computer Programmer"
## [15] "Construction"
## [16] "Dentist"
## [17] "Doctor"
## [18] "Engineer - Chemical"
## [19] "Engineer - Electrical"
## [20] "Engineer - Mechanical"
## [21] "Executive"
## [22] "Fireman"
## [23] "Flight Attendant"
## [24] "Food Service"
## [25] "Food Service Management"
## [26] "Homemaker"
## [27] "Investor"
## [28] "Judge"
## [29] "Laborer"
## [30] "Landscaping"
## [31] "Medical Technician"
## [32] "Military Enlisted"
## [33] "Military Officer"
## [34] "Nurse (LPN)"
## [35] "Nurse (RN)"
## [36] "Nurse's Aide"
## [37] "Other"
## [38] "Pharmacist"
## [39] "Pilot - Private/Commercial"
## [40] "Police Officer/Correction Officer"
## [41] "Postal Service"
## [42] "Principal"
## [43] "Professional"
## [44] "Professor"
## [45] "Psychologist"
## [46] "Realtor"
## [47] "Religious"
## [48] "Retail Management"
## [49] "Sales - Commission"
## [50] "Sales - Retail"
## [51] "Scientist"
## [52] "Skilled Labor"
## [53] "Social Worker"
## [54] "Student - College Freshman"
## [55] "Student - College Graduate Student"
## [56] "Student - College Junior"
## [57] "Student - College Senior"
## [58] "Student - College Sophomore"
## [59] "Student - Community College"
## [60] "Student - Technical School"
## [61] "Teacher"
## [62] "Teacher's Aide"
## [63] "Tradesman - Carpenter"
## [64] "Tradesman - Electrician"
## [65] "Tradesman - Mechanic"
## [66] "Tradesman - Plumber"
## [67] "Truck Driver"
## [68] "Waiter/Waitress"
## [1] "$0" "$1-24,999" "$100,000+" "$25,000-49,999"
## [5] "$50,000-74,999" "$75,000-99,999" "Not displayed" "Not employed"
## [1] "" "Employed" "Full-time" "Not available"
## [5] "Not employed" "Other" "Part-time" "Retired"
## [9] "Self-employed"
## Term LoanStatus ClosedDate
## Min. :12.00 Current :56576 :58848
## 1st Qu.:36.00 Completed :38074 2014-03-04 00:00:00: 105
## Median :36.00 Chargedoff :11992 2014-02-19 00:00:00: 100
## Mean :40.83 Defaulted : 5018 2014-02-11 00:00:00: 92
## 3rd Qu.:36.00 Past Due (1-15 days) : 806 2012-10-30 00:00:00: 81
## Max. :60.00 Past Due (31-60 days): 363 2013-02-26 00:00:00: 78
## (Other) : 1108 (Other) :54633
## ListingCategory..numeric. BorrowerState
## Min. : 0.000 CA :14717
## 1st Qu.: 1.000 TX : 6842
## Median : 1.000 NY : 6729
## Mean : 2.774 FL : 6720
## 3rd Qu.: 3.000 IL : 5921
## Max. :20.000 : 5515
## (Other):67493
## Occupation IncomeRange IncomeVerifiable
## Other :28617 $25,000-49,999:32192 False: 8669
## Professional :13628 $50,000-74,999:31050 True :105268
## Computer Programmer : 4478 $100,000+ :17337
## Executive : 4311 $75,000-99,999:16916
## Teacher : 3759 Not displayed : 7741
## Administrative Assistant: 3688 $1-24,999 : 7274
## (Other) :55456 (Other) : 1427
## StatedMonthlyIncome CreditScoreRangeLower ProsperScore
## Min. : 0 Min. : 0.0 Min. : 1.00
## 1st Qu.: 3200 1st Qu.:660.0 1st Qu.: 4.00
## Median : 4667 Median :680.0 Median : 6.00
## Mean : 5608 Mean :685.6 Mean : 5.95
## 3rd Qu.: 6825 3rd Qu.:720.0 3rd Qu.: 8.00
## Max. :1750003 Max. :880.0 Max. :11.00
## NA's :591 NA's :29084
## EmploymentStatus EmploymentStatusDuration CurrentCreditLines
## Employed :67322 Min. : 0.00 Min. : 0.00
## Full-time :26355 1st Qu.: 26.00 1st Qu.: 7.00
## Self-employed: 6134 Median : 67.00 Median :10.00
## Not available: 5347 Mean : 96.07 Mean :10.32
## Other : 3806 3rd Qu.:137.00 3rd Qu.:13.00
## : 2255 Max. :755.00 Max. :59.00
## (Other) : 2718 NA's :7625 NA's :7604
## TotalCreditLinespast7years DebtToIncomeRatio BorrowerRate
## Min. : 2.00 Min. : 0.000 Min. :0.0000
## 1st Qu.: 17.00 1st Qu.: 0.140 1st Qu.:0.1340
## Median : 25.00 Median : 0.220 Median :0.1840
## Mean : 26.75 Mean : 0.276 Mean :0.1928
## 3rd Qu.: 35.00 3rd Qu.: 0.320 3rd Qu.:0.2500
## Max. :136.00 Max. :10.010 Max. :0.4975
## NA's :697 NA's :8554
## LoanOriginalAmount LoanOriginationDate
## Min. : 1000 2014-01-22 00:00:00: 491
## 1st Qu.: 4000 2013-11-13 00:00:00: 490
## Median : 6500 2014-02-19 00:00:00: 439
## Mean : 8337 2013-10-16 00:00:00: 434
## 3rd Qu.:12000 2014-01-28 00:00:00: 339
## Max. :35000 2013-09-24 00:00:00: 316
## (Other) :111428
##
## 12 36 60
## 1614 87778 24545
Maximum number of term are of 36 months with 87778 number of entries, there are few 12 month loans compared to 36 and 60 month loan, surprisingly people have not opted for 48 month loan either they are going 1,3 or 5 year of loan term.
##
## Cancelled Chargedoff Completed
## 5 11992 38074
## Current Defaulted FinalPaymentInProgress
## 56576 5018 205
## Past Due (>120 days) Past Due (1-15 days) Past Due (16-30 days)
## 16 806 265
## Past Due (31-60 days) Past Due (61-90 days) Past Due (91-120 days)
## 363 313 304
There are 11992 charged off and 5018 defaulted loan statuses that’s around 16% of the loans has been defaulted or most probably to be defaulted, this seems a high number.
##
## 0 1 2 3 4 5 6 7 8 9 10 11
## 16965 58308 7433 7189 2395 756 2572 10494 199 85 91 217
## 12 13 14 15 16 17 18 19 20
## 59 1996 876 1522 304 52 885 768 771
More that half of the loans are in debt consolidation category, next higher count excluding Not Available and Other category are in Home Improvements and Business.
##
## 1 2 3 4 5 6 7 8 9 10 11
## 992 5766 7642 12595 9813 12278 10597 12053 6911 4750 1456
From the looks of the histogram we can see that the result are showing up like a bell curve, where most of the data is around 4 - 8 while few having < 2 or > 10 scores.
##
## Accountant/CPA
## 3588 3233
## Administrative Assistant Analyst
## 3688 3602
## Architect Attorney
## 213 1046
## Biologist Bus Driver
## 125 316
## Car Dealer Chemist
## 180 145
## Civil Service Clergy
## 1457 196
## Clerical Computer Programmer
## 3164 4478
## Construction Dentist
## 1790 68
## Doctor Engineer - Chemical
## 494 225
## Engineer - Electrical Engineer - Mechanical
## 1125 1406
## Executive Fireman
## 4311 422
## Flight Attendant Food Service
## 123 1123
## Food Service Management Homemaker
## 1239 120
## Investor Judge
## 214 22
## Laborer Landscaping
## 1595 236
## Medical Technician Military Enlisted
## 1117 1272
## Military Officer Nurse (LPN)
## 346 492
## Nurse (RN) Nurse's Aide
## 2489 491
## Other Pharmacist
## 28617 257
## Pilot - Private/Commercial Police Officer/Correction Officer
## 199 1578
## Postal Service Principal
## 627 312
## Professional Professor
## 13628 557
## Psychologist Realtor
## 145 543
## Religious Retail Management
## 124 2602
## Sales - Commission Sales - Retail
## 3446 2797
## Scientist Skilled Labor
## 372 2746
## Social Worker Student - College Freshman
## 741 41
## Student - College Graduate Student Student - College Junior
## 245 112
## Student - College Senior Student - College Sophomore
## 188 69
## Student - Community College Student - Technical School
## 28 16
## Teacher Teacher's Aide
## 3759 276
## Tradesman - Carpenter Tradesman - Electrician
## 120 477
## Tradesman - Mechanic Tradesman - Plumber
## 951 102
## Truck Driver Waiter/Waitress
## 1675 436
There are lot of proffesions given here, I have removed some of the outliers of Others and Professional entries but still seeiing that much that on x axis is unreadable, I can do a axis flip so that we can see the occupation on y axis in full text.
There are maximum Computer Programmers
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 591 rows containing non-finite values (stat_bin).
There are few outliers with 0 credit score, need to remove those
##
## 0 360 420 440 460 480 500 520 540 560 580 600
## 133 1 5 36 141 346 554 1593 1474 1357 1125 3602
## 620 640 660 680 700 720 740 760 780 800 820 840
## 4172 12199 16366 16492 15471 12923 9267 6606 4624 2644 1409 567
## 860 880
## 212 27
Maximum people are in range 650 - 750, it would be interesting to compare the defaulters to the credit score ratings, people with lower ratings must be have high defaulting or charged off loan status.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
To get an general idea of where the most of people lie in, need to increase the binwidth for that
Mostly people are in range 3000 - 6000, monthly income should have high correlation with the monthly income.
## Loading required package: lubridate
##
## Attaching package: 'lubridate'
##
## The following object is masked from 'package:memisc':
##
## is.interval
##
## The following object is masked from 'package:base':
##
## date
Highest number of loans are closed in the 2014 and from 2010 - 2013 it has remained constant.
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
## 385 1351 2467 3553 4804 6367 7449 8945 8985 8731 8152 7500 6530 5677 4927
## 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
## 3985 3468 2619 2242 1730 1377 1068 828 670 563 446 348 251 205 145
## 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
## 119 91 75 62 39 40 34 23 23 13 10 8 3 1 4
## 45 46 47 48 51 52 54 56 59
## 3 1 3 3 1 3 3 2 1
From the plot it seems that on an average people have around 7 - 12 credit lines, with some even having as far as 59 credit lines open.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 8554 rows containing non-finite values (stat_bin).
Big outliers are in the plot, need to clean those, also need to set the binwidth to an lower amount to get better plot.
Debt to income ratio for majority lies aroung 0.25.
The main feature in this dataset is the Loan Status, Prosper Score and Credit Score relations. I think that there has to be a direct corelation between the probability of some defaulting a loan is attached to the prosper and credit score for the loanee.
Other features that might interest is the income range and the employment status, it is possible that someone with good credit score and good prosper score is unemployed for a while and his loan is going to be defaulted soon.
Yes I added a closed data year, this might help me visualize in which year most of the loans were closed and then again subset data and see how many of these closed loans were defaulted, cancelled or chargedof, I predicting during the recession around 2008 - 2010 the ratio of completed loans might be less than later years.
There were some outliers in credit score reange had to clean that up to get a good view of the income ranges. Then in viewing the occupations due to large number of entries needed to flip the axis to get a better view of the plot. In the Listing category plot my initial thought was auto loans or home loans might be the max number of loans listed, but from the plot the max number of loans were for debt consolidation which was quite surprising, I wen through some articles regarding this and people have stated that a debt consolidation loan is has higher probability than others to get defaulted, and still I see a very high number in that category.
First I want to compare credit scores with monthly income, and see if there is any correlation between them or not.
## Warning: Removed 591 rows containing missing values (geom_point).
This data is not giving the correct picture due to various outliers here, probably I should consider monthly incomes less than 10000 and also remove data with 0 credit scores
Too many points are there, to get a nice idea of the plot, need to add alpha bending to the plot to get a better view of the plot of where the area is more dense in the scatter plot.
I can see some relation between which seems linear, adding a smooth line can sugest better where the plot is moving towards
Now we can see a clear smooth line moving in a linear direction, also the corelation between them is positive 0.22, this seems to me a low score, logically these 2 scores should have much higher correlations.
Next we can compare Prosper Score with the Credit Score.
Plot shows similar characteristics as shown in comparison in CreditScore vs Monthly Income, where many points are there and on smoothning the data only we can see a mostly linear relation between the two variables, but in the end even people with higher credit scores were having lower prosper scores. Correlation is 0.37 between these 2
Now I want to compare the Loan Status with the Credit Score, I woul dmake the binwidth as 20 to group some of the credit scores together to get a better view of the bar graph, also removing data with LoanStatus as Current
## LoanStatus: Cancelled
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 500 515 580 595 660 720 1
## --------------------------------------------------------
## LoanStatus: Chargedoff
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 600.0 660.0 648.9 700.0 860.0 48
## --------------------------------------------------------
## LoanStatus: Completed
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 640.0 680.0 685.6 740.0 880.0 416
## --------------------------------------------------------
## LoanStatus: Current
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 600.0 660.0 700.0 698.7 720.0 880.0
## --------------------------------------------------------
## LoanStatus: Defaulted
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 560.0 640.0 620.9 680.0 860.0 126
## --------------------------------------------------------
## LoanStatus: FinalPaymentInProgress
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 600.0 660.0 700.0 700.4 740.0 820.0
## --------------------------------------------------------
## LoanStatus: Past Due (>120 days)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 640.0 675.0 680.0 687.5 700.0 780.0
## --------------------------------------------------------
## LoanStatus: Past Due (1-15 days)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 600.0 660.0 680.0 687.6 720.0 860.0
## --------------------------------------------------------
## LoanStatus: Past Due (16-30 days)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 600.0 660.0 680.0 682.2 720.0 820.0
## --------------------------------------------------------
## LoanStatus: Past Due (31-60 days)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 600.0 660.0 680.0 691.5 720.0 820.0
## --------------------------------------------------------
## LoanStatus: Past Due (61-90 days)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 600.0 660.0 680.0 688.8 720.0 820.0
## --------------------------------------------------------
## LoanStatus: Past Due (91-120 days)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 600.0 660.0 680.0 690.2 720.0 820.0
In lower credit scores there are more ChargedOff and Defaulted loans compared to current and completed. In higher scores we can see the ratio of completed and current loans higher.
I am comparing LoanStatus with the year and see is their any relation between them or not, I am looking for does the recession after 2008 added the jumps in number of defaulted or charged off loans.
We can see number of defaulted and chargedoff suddenly rose after 2007 and it went on till 2010 after that the ratio is not that much.
I will compare states and Loan Status and see if there is some relation we can find here.
The distribution seems normal here, ratio for each state seems to be the same here.
I will now compare Listing category with the Loan Status.
A lot of people are in debt consolidation, and also we have higher number of defaulted, charged off in debt consolidation and unknown category, but I could not find any much different because the ratio is almost the same.
Next I want to comapre correlation between credit score and number of credit lines.
## loanData$CreditScoreRangeLower: 0
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## NA NA NA NaN NA NA 133
## --------------------------------------------------------
## loanData$CreditScoreRangeLower: 360
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## NA NA NA NaN NA NA 1
## --------------------------------------------------------
## loanData$CreditScoreRangeLower: 420
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## NA NA NA NaN NA NA 5
## --------------------------------------------------------
## loanData$CreditScoreRangeLower: 440
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## NA NA NA NaN NA NA 36
## --------------------------------------------------------
## loanData$CreditScoreRangeLower: 460
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## NA NA NA NaN NA NA 141
## --------------------------------------------------------
## loanData$CreditScoreRangeLower: 480
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## NA NA NA NaN NA NA 346
## --------------------------------------------------------
## loanData$CreditScoreRangeLower: 500
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## NA NA NA NaN NA NA 554
## --------------------------------------------------------
## loanData$CreditScoreRangeLower: 520
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 2.00 4.00 5.01 7.00 30.00 442
## --------------------------------------------------------
## loanData$CreditScoreRangeLower: 540
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 3.000 6.000 6.851 9.000 31.000 736
## --------------------------------------------------------
## loanData$CreditScoreRangeLower: 560
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 4.000 7.500 8.484 12.000 29.000 359
## --------------------------------------------------------
## loanData$CreditScoreRangeLower: 580
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 5.000 8.000 9.236 13.000 41.000 281
## --------------------------------------------------------
## loanData$CreditScoreRangeLower: 600
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 5.00 9.00 9.49 13.00 45.00 604
## --------------------------------------------------------
## loanData$CreditScoreRangeLower: 620
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 6.000 9.000 9.798 13.000 52.000 537
## --------------------------------------------------------
## loanData$CreditScoreRangeLower: 640
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 5.000 8.000 9.102 12.000 48.000 621
## --------------------------------------------------------
## loanData$CreditScoreRangeLower: 660
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 6.000 9.000 9.697 12.000 54.000 398
## --------------------------------------------------------
## loanData$CreditScoreRangeLower: 680
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 7.00 10.00 10.57 13.00 54.00 425
## --------------------------------------------------------
## loanData$CreditScoreRangeLower: 700
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 7.00 10.00 10.97 14.00 59.00 283
## --------------------------------------------------------
## loanData$CreditScoreRangeLower: 720
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 7.00 10.00 11.09 14.00 56.00 312
## --------------------------------------------------------
## loanData$CreditScoreRangeLower: 740
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 7.00 10.00 10.99 14.00 51.00 217
## --------------------------------------------------------
## loanData$CreditScoreRangeLower: 760
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 7.00 10.00 10.98 14.00 44.00 180
## --------------------------------------------------------
## loanData$CreditScoreRangeLower: 780
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 7.00 10.00 11.07 14.00 43.00 151
## --------------------------------------------------------
## loanData$CreditScoreRangeLower: 800
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 7.00 10.00 10.93 14.00 38.00 115
## --------------------------------------------------------
## loanData$CreditScoreRangeLower: 820
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 7.00 10.00 10.96 13.00 39.00 72
## --------------------------------------------------------
## loanData$CreditScoreRangeLower: 840
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.00 7.00 10.00 10.61 13.00 32.00 34
## --------------------------------------------------------
## loanData$CreditScoreRangeLower: 860
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.000 7.000 9.000 9.838 12.000 26.000 27
## --------------------------------------------------------
## loanData$CreditScoreRangeLower: 880
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 3.00 7.00 8.50 8.75 10.00 18.00 3
People with higher credit lines tend to stay mostly around 700 range and usually people with higher credit score something like greater than 750 tend to have fewer credit lines.
I should also take a look into comparison of DebtToIncomeRatio and LoanStatus.
If I create the plot into one graph, there is nothing I can deduce from it, so I did a facet wrap to see how the curve changes in each individual category. So here if we compare defaulted to the completed cureves, completed one moves steep while going to 0.20 range and then comes down steeply but if you see in defaulted or even charged off plots the curve is not that steep, that means people with lower debt tend to default or have their loans charged off.
Comparing Employment Status and Loan Status.
In this case it seems that for Employed and Full time employment status people had mostly charged off their loan amounts instead of defaulted.
Next I want to see the status of loans started in a certain year, for this I would need to create LoanYear variable that would give us the year when the loan started.
We can see that maximum number of loans defaulted and charged off are from 2006 - 2008 while it has decreased by lot in after years, also in 2009 there is a major drop in loans initiated this might be due to the recovering economy from the recession.
Next I want to compare the relation between Credit Lines and Monthly Debt and analyse there correlation. For monthly debt I would need to multiple debt to income ratio to the monthly income variable
## Warning: Removed 16034 rows containing missing values (geom_point).
Need to clear some outliers where monthly debt is way higher, let smake it lesser than 8000 as most of the observations are within that range, and also moving credit lines less than 40
There is a high correlation between these 2 variables, correlation is around 0.595595, so generally people with higher debt tends to have higher number of credil lines open.
The correlation between Borrower Rate and Credit Score is negative -0.488 which seems logical as people with higher credit score get lesser loan rates.
Negative correlation of -0.33 which seems opposite of what I thought, larger amounts should be having higher rates due to higher risk involved in that.
Positive correlation between Credit Score and Loan Amount of 0.35, people with higher credit scores took higher loans.
Lots of loan given at 0.325 interest rate compared to others around.
Loan amounts decreased in year 2009 and after that it has been increasing and its comapratively way higher in 2013 and 2014 than other years.
Higher income group took out higher loan amounts, which seems intuitive.
Strongest relationship was between Monthly Debt to number of Current Credit Lines for an individual.
Higher Credit Scores, higher loan amount is given and lower borrower rate is also given, for better visualization need to strip down data for credit scores greater than 680 only.
In $100,000+, $25k-50k, $50k-75k and $75k-99k we can see definite increase in loan amounts, but we see drop in loan amounts for not employed persons and no increase for later years for $1-24k section.
People with higher debt to income ratio and higher credit scores have lesser prosper score, and people with higher credit scores but lesser debt to income ratio have higher prosper score, this explains the low correlation of prosper and credit scores.
Monthly debt and current credit lines do have high correlation and after adding credit score ranges as well people with lower credit scores have lower monthly debt and as we move above in the plot we can see mixed results, but people with higher credit scores and more credit lines do have higher monthly debt as well.
Looking at the loan amounts and borrower rates, people with higher credit scores were definitely given lower interest rates even with higher loan amounts, also another relation came up with higher the income higher the loan amount is.
Looking more into the low correlation between prosper score and credit score, I added debt to income ratio in this plot, this explained a lot of relation between these 2 variables, higher debt to income makes prosper score lower even for people with higher credit score.
Year by year there have been changes in 2006 most of the people were given low rates, then in 2007 a bit of rates increased, then in 2009 people were taking lower loan amounts but still getting higher rates and same goes in 2010 lower loan amounts and still higher rates, then again rates started goin lower after 2011 and also the loan amounts started increasing after that in 2011, 2012 and 2013 and in 2014 we can see that rates have dropeed and amounts have increased quite a lot. Before 2014 we can also see clear demarcation between rates given to people with higher credit score but that demarcation has vanished in 2014.
Loan Amounts have increased in 2013 and 2014 but still the debt to income ratio is much lower, while if we look in year 2007 and 2008 we can see many people are having worse debt to income ratio nearer to 1, in 2009 and 2010 most people maintained good debt to income ratio for their loans it again gets worse in 2011 and 2012. One more thing I can deduce people in low income range $1-24999 have poor debt to income ratio, also as we go up in income range we can see higher loan amounts as well.
People in income ranges $1-24,999 and $25-49,999 have higher debt to income ratio and so they have mostly lower prosper score while in income ranges above 50k we can see more cluster is getting darker towards the right of the plot and also people are having lower debt to income ratio as well.
The Loan data set had 114000 loan observation for years 2006 to 2014 with 81 variables, for this problem set I chose 22 variables for analysis. The difficulties I had at first was choosing correct and smaller dataset for my work, wanted to maintain a smaller dataset so I wen tin with 15 variables at first then while going through some of the analysis I wen through more variables and found that they might be providing better analysis or have better correlation with some of the oher variables in the set. I would have like to model around the data but I think I would need to take the next courses for that to implement. In the data through multiple plots I could see the company was struggling with loans, and the situation was worse during the recession years where they were probably to lenient in giving out loans to people even though people were having bad debt to income ratio, but I think they have recovered from that since 2012 where I can see they have applied some strictness and giving good loans only.